Research on a Weighted Gene Co-expression Network Analysis method for mining pathogenic genes in thyroid cancer

Thyroid cancer (TC) is one of the most common thyroid malignancies occurring worldwide, and accounts for about 1% of all the malignant tumors. It is one of the fastest growing tumor and can occur at any age, but it is more common in women. It is important to find the pathogenesis and treatment targets of TC. In this pursuit, the present study was envisaged to investigate the effective carcinogenic biological macromolecules, so as to provide a better understanding of the occurrence and development of TC. The clinical and gene expression data were collected from The Cancer Genome Atlas (TCGA). We clustered mRNA and long non-coding RNA (lncRNA) into different modules by Weighted Gene Co-expression Network Analysis (WGCNA), and calculated the correlation coefficient between the genes and clinical phenotypes. Using WGCNA, we identified the module with the highest correlation coefficient. Subsequently, by using the differential genes expression analysis to screen the differential micro-RNA (miRNA), the univariate Cox proportional hazard regression was employed to screen the hub genes related to overall survival (OS), with P < 0.05 as the statistical significance threshold. Finally, we designed a hub competitive endogenous RNA(ceRNA) network of disease-associated lncRNAs, miRNAs, and mRNAs. From the results of enrichment analysis, the association of these genes could be related to the occurrence and development of TC, and these hub RNAs can be valuable prognostic markers and therapeutic targets in TC.


Introduction
Thyroid cancer (TC) is the most common endocrine system malignant tumor, and is the endocrine tumor with the fastest growth rate in the world [1,2]. Although the overall survival rate of TC patients is better, the incidence rate is increasing significantly [3,4], and it occurs frequently in women [5]. The age standardized statistics show that, the global incidence of thyroid cancer was 2.2/100,000 men/year and 4.4/100,000 women/year. The incidence increased by 50% from 2006 to 2016, being the fastest increase among all malignant solid tumors [6]. In 2019, thyroid cancer ranked third among the female malignancies [7]. However, the overall survival rate is good, and most TCs can be effectively controlled by treatments such as surgery and endocrine suppression. Nevertheless, the mortality rate of advanced thyroid cancer still remains an important concern. Thus, it is very important to explore the potential molecular mechanisms involved in TC, so as to find effective diagnostic and therapeutic targets.
In the past few decades, the research on TC has mainly been based on molecular characteristics, morphological characteristics and immunological characteristics to explore the tumorassociated protein coding genes. However, in mammals, the protein-encoding RNA covers only 2% of the total genome. Therefore, it is necessary to discover the functions of non-coding RNA (ncRNA), including the long non-coding RNA (lncRNA) and the micro-RNA (miRNA) [8,9]. In the past, lncRNA was supposed to belong to the noise on the road of gene sequencing analysis, and the related research on ncRNA mainly focused on the miRNA. Subsequently, it was found that miRNAs could regulate the cell growth and apoptosis [10,11]. In 2005, several studies showed that miRNA were uniquely and differentially expressed in cancer tissues compared with normal adjacent tissues. For example, the expression of miRNA let-7 was downregulated in lung cancer but not in other cancers, such as breast or colon cancer [12,13]. These studies illustrate that the expression profiling data can serve as an effective diagnostic tool for the detection of biomarkers and cancer-related genes. With the development of high-throughput sequencing technology and recent studies on lncRNAs, it has been shown that lncRNAs are closely related to the occurrence and development of various diseases and tumors [14]. Some lncRNAs were used as important regulators to participate in the whole process of life activities, and even some lncRNAs were directly regarded as biomarkers in the clinical trials [15]. For example, lncRNA PCA3, which is highly upregulated and specific to prostate cancer, was detected in the urine with levels that correspond to the severity of prostate cancer [16]. In 2011, Sal-mena et al. [17] proposed the competitive endogenous RNA (ceRNA) hypothesis, wherein specific RNAs, such as lncRNA, circular RNA (circRNA) and pseudogenes competed with the miRNA through miRNA response element (MRE), thereby affecting the gene silencing caused by miRNA. Therefore, the pathogenic mechanism of ceRNA has attracted a lot of attention. In many malignant tumors such as liver cancer and lung adenocarcinoma, the mechanism of tumor occurrence and development caused by lncRNA-miRNA-mRNA has been elucidated [18,19]. Studies have reported that a novel ceRNA activity of lncRNA MEG3 in human gastric cancer cells, lncRNA MEG3 inhibits the cell proliferation, migration and invasion in gastric cancer by competitively binding the miR-181 family, upregulating Bcl-2, and suppressing gastric carcinogenesis [20]. The study of Zhang et al. reported that a ceRNA module was mediated by has-miR-145 and the has-miR-16 in normal state, while has-miR-16 was replaced by three miRNAs including has-miR-34a,has-miR-29b and has-miR-15a in the prostate tumor state [21]. Interestingly, these three replaced miRNAs in tumor state are relevant in the tumorigenesis of prostate cancer [22][23][24]. In summary, these observations highlight the means of analyzing the dynamic regulation of ceRNA interactions in the exploration of the mechanism of tumorigenesis.
In the present study, TC sample and para-cancer samples were procured from The Cancer Genome Altas (TCGA) database, and were analyzed at the level of lncRNA, miRNA and mRNA. Weighted Gene Co-expression Network Analysis (WGCNA) was used to obtain the related lncRNAs and mRNAs with significant positive correlation to TC cancer. The significantly different miRNAs were screened out by the differential genes expression analysis. Then, the univariate Cox proportional hazard regression was employed to screen the hub mRNAs related to overall survival. Finally, a novel lncRNA-miRNA-mRNA ceRNA network was constructed for TC. These hub RNAs can be valuable prognostic markers and therapeutic targets in TC. This study comprehensively analysed the TC-related ceRNA network to allow a more convenient tool to explore the post-transcriptional regulatory mechanism underlying in TC and suggest novel biomarkers for the diagnosis, treatment, and prognosis of TC.

Data source
The TC RNA-seq expression profile of counts type and miRNA-seq of BCGSC miRNA profile were downloaded from TCGA (Project ID: TCGA-THCA, Data Release: 29.0, https://portal. gdc.cancer.gov/). Among them, the RNA-seq consisted of 560 samples, including 58 normal samples and 502 tumor samples, and the miRNA-seq consisted of 573 samples, including 57 normal samples and 516 tumor samples. At the same time, the clinical data of 506 patients with TC was downloaded. Since RNA-seq from TCGA contains lncRNA and mRNA, the annotation files were downloaded from Gencode (Data Release: 37, https://www. gencodegenes.org/) to annotate the RNA-seq and classify lncRNA and mRNA.

Dataset preprocessing
Initially, the RNA-seq data was annotated and classified through the annotation file downloaded by Gencode; the data consisted of 19563 mRNA and 15096 lncRNA. Because the mutated gene is usually considered as background noise, in order to reduce the impact of a small number of outliers on the overall data, we selected 5000 genes according to average expression >1 and the greatest median absolute deviation (MAD). Subsequently, we used the R language to draw a sample cluster map to see if there were any outliers and removed them. Finally, log(x+1) normalization was performed on the gene expression profiles, so that the upregulated or down-regulated genes were continuously distributed around 0.

Weighted Genes Co-expressing Network Analysis
Weighted Gene Co-expression Network Analysis (WGCNA) refers to a biology method used to describe the gene association patterns between different samples [25]. It is an algorithm based on high-throughput gene expression profiles. It is based on the connectivity of genes and the correlation between genes to cluster gene modules, analyzes the co-expression relationship between genes, and screens out the core genes closely associated with other gene [26]. Genes in the same module are considered to have high correlation, consistent expression level, and similar biological functions, which makes it helpful to explore the biological functions of the different modules [27]. In undirected network, the genes in the module are highly related, and in directed network, they are highly positively correlated. The undirected network was employed in this study.
We used the WGCNA-R package to calculate the independence and average connectivity of the different modules under different power values, and selected the value when the independence coefficient R 2 reached 0.8 for the first time. In this study, the power value for mRNA modules was screened out as 7, and the power value for lncRNA modules was screened out as 6. According to the internal connectivity and soft threshold of genes, mRNA and lncRNA were clustered into different co-expression modules. Further, we calculated the correlation between the module and the clinical phenotype, and selected the module with the greatest correlation coefficient.

Preliminary screening of related miRNAs
The redundant information of the miRNA data was removed, and only the gene expression data in the sample was retained and divided into two groups according to the clinical dominance of the sample, cancer group and normal group. Using "Limma" R package, we performed the differential genes expression analysis of miRNA expression profile, identified the genes that were differentially expressed in normal and cancer tissues. 79 miRNA were screened out exhibiting a significant correlation with TC at P<0.05.

Usage of databases
By WGCNA, we obtained the mRNA module and lncRNA module that exhibit a highly significant correlation. miRcode [28] was used to predict the targeted miRNA by lncRNA in the lncRNA module with significant correlations.Targetscan [29] and miRDB [30] were used to predict mRNAs associated with miRNA.

Clustering of mRNA co-expressing module
From the mRNA expressing profiles of 560 TC cases in TGCA database, 19563 mRNA were selected according to MAD within the top 5000 with an average expression > 1. The samples that were already metastasized were removed in the construction of co-expression modules using WGCNA. First, we performed the sample cluster analysis [31] on the data to see if there were outliers; the sample clustering is shown in Fig 1A. Next, we selected the power value which affects the independence and average connectivity, according to the result of Fig 1B and selected the power value that was suitable for the undirected networks. After testing, it was found that when the power value was 7, the degree of independence was > 0.8 for the first time and it best met the requirements of the scale-free network. We used the power value to construct a weighted gene co-expression network. Eight modules were divided by this network as shown by Fig 1C. The correlation analysis between the downloaded clinical data of TC from TCGA and the constructed modules showed that the brown module exhibited a significant correlation with the clinical phenotype of TC; the correlation coefficient is shown in Fig 1D. The brown module consisted of 573 genes (WGCNA-mRNA). As a result, it was initially believed that 573 mRNAs exhibited a significant correlation with the TC samples.

Clustering of LncRNA co-expressing module
LncRNA co-expressing modules was constructed using the same method of the mRNA coexpressing module. After pre-processing the lncRNA expression profiles, we clustered them hierarchically and observed the sample distribution; the cluster tree is shown in as Fig 2A. Subsequently, we chose a power value suitable for lncRNA expression profiles to construct a coexpression network. As shown in Fig 2B, the power value of 6 was found to be the most appropriate. Using this power value to construct a co-expressing network, and dividing the genes into modules, a total of 8 modules were obtained, as shown in Fig 2C. It can be seen from Fig  2D, the yellow module exhibited a highest positive correlation coefficient with TC, and the module contains 220 genes (WGCNA-lncRNA).

Preliminary screening of miRNA
Because gene mutations become the biggest noise in the analysis of cancer-causing genes, we removed the genes whose expression level was 0 in 75% of the samples, and log(x+1) normalized the data. Subsequently, we constructed a cluster heat map as shown in Fig 3A to observe the overall expression level of miRNAs. Finally, we performed the differential genes expression analysis on the pre-processed miRNA, and obtained 78 differential genes (DEG-miRNA); the volcano plot of DEG-miRNAs is shown in Fig 3B.

Prediction of target relationship base on network databases
Considering the large scale of the data, the database files were downloaded to the local for manual comparison, and we used the loop statement to screen them one by one, so as to obtain the common genes.
miRcode was used to predict the miRNA target of 220 WGCNA-lncRNAs, and we obtained 125 miRNA targets (Predict-miRNAs) that exhibited a binding with them. We then compared them with 78 DEG-miRNAs obtained from the differential genes expression analysis to get the intersection, and constructed the Wayne diagram as shown in Fig 4A. From the figure, we got 8 miRNA.
TargetScan and mirDB were used to predict the target mRNA of miRNA. In order to better confirm the accuracy of the prediction results, we selected the mRNA from the intersection of the two database predictions, and 2116 mRNA were obtained. Similarly, we compared these 2116 mRNAs with 573 WGCNA-mRNAs, and as shown in the Fig 4B, we obtained 87 mRNA.

Univariate cox regression screening for hub-mRNAs
Univariate Cox proportional hazard regression model was used to screen out the preliminary related-mRNAs by verifing the correlation between the expression levels of these 87 common mRNAs and the overall survival (OS). The threshold P value was set to 0.05.
The R package "survival" was used to construct the univariate Cox proportional hazard regression model.As shown in Table 1, 17 genes with P value <0.05 were obtained from the univariate Cox proportional hazard regression model. Then, we divided the TC samples into high-risk and low-risk group and observed the overall distribution of the TC samples as shown in Fig 5A. To test the prediction accuracy of the model, we constructed the ROC curves (showed in Fig 5B). The survival rate of the high-risk group was significantly lower than that of the low-risk group (p = 0.0018), as shown in Fig 5C.

Construction of ceRNA network
From the comparison of the data from the experimental and database, we obtained 17 mRNAs, 19 lncRNAs and 5 miRNAs. In order to make the network structure more intuitive, the ceRNA network was performed and visualized as shown in Fig 6 using

KEGG pathway enrichment analysis
In order to further reveal the potential biological functions of mRNAs in TC, the KEGG Pathway enrichment analysis was performed on the17 mRNAs. For gene functional enrichment analysis, KEGG rest API (https://www.kegg.jp/kegg/rest/keggapi.html) was used to get the latest gene annotation, and used as a background to map the genes to the background set. Finally, the R package clusterProfiler (version 3.14.3) was employed in the enrichment analysis to obtain the result of genes enrichment [32]. Fig 7A and 7B show that a total of 31 KEGG pathways were enriched with 9 mRNA, including some tumor-related, such as PI3K-Akt signaling pathway, microRNAs in cancer, mTOR signling pathway and so on.

Survival analysis
In the present study,the R package "Survival" was employed, and we integrated the survival time, survival status and gene expression data in the dataset. The prognostic significance of all mRNA, miRNA in the ceRNA network and overall survival was evaluated using P < 0.05 as the cut-off threshold. Fig 8 reveals the Kaplan-Meier survival curves, which show that 4 mRNAs (DDIT4, ENTPD1, LDLR, MIDN) and 1 miRNA (miR-125a-5p) exhibited a significant impact on the prognosis of TC patients.
Considering that the prognosis of different types of thyroid cancer is quite different, all cancer samples were divided into four parts based on TNM stage (2017 AJCC, 8 th ed.). After sorting out, we found that stage I, II and III belonged to the differentiated stage, and stage IV belonged to the anaplastic stage. Survival analysis was performed in four parts. The results are shown in the Fig 9, and it can be seen that the survival probability of stage IV was significantly lower than stage I, II, III. From these results, it is clear that the anaplastic thyroid cancer has a worse prognosis than the differentiated thyroid cancer. Based on the above results, we defined stage IV as the high-risk group and stage I, II and III as the low-risk group. Subsequently, we made two groups of cancer samples of equal size with high risk and low risk, and consteructed the violin and scatter plots combined with gene expression data; the results are shown in Fig  10. Onobserving the overall expression levels of the ceRNA internal genes in the two groups of samples, it can be found that most of the key genes in the high-risk group were much higher than those in the low-risk group. To a certain extent, this reflects the role of key genes in the different stages of TNM, that is to say, the overexpression of key genes greatly reduced the overall survival rate of patients.
The overall flow of the experiment is shown in the Fig 11.

Discussion
Thyroid cancer (TC), as the most common endocrine system malignant tumor, is mainly prevalent in the children and teenagers [4]. The incidence in women is about 3 times that of the incidence in men [33]. Although the death rate of TC is lower than many other cancers, the death rate in the terminal stage of TC is not optimistic. Therefore, it is very essential to study and find the potential pathogenic mechanisms and therapeutic targets of TC. At present, the main method of clinical research on malignant tumors is gene sequencing, wherein the tumor samples are sequenced to determine whether there are any genetic mutations and accordingly confirm the target treatment plan. Numerous recent studies have proved that lncRNA can competitivelly bind the miRNA to regulate the expression of the target genes based on the ceRNA hypothesis [34,35]. These confirmations provide a reference for exploreing the mechanism of lncRNA, miRNA and mRNA in TC. In the past, ther research on TC was mainly focused on mRNA, and there were few studies on lncRNA [36]. Although a few

PLOS ONE
studies have shown that there are some genes that are significantly differentially expressed in TC, most of them have not shown the mechanism of these genes in the process of the tumor development. This phenomenon has been alleviated with the development of high-throughput sequencing technology, but correspondingly, the development of high-throughput sequencing technology generates a large amount of sequencing data, which ist difficult for subsequent analysis and interpretation of biological problems. Recent studies have shown that the ceRNA network in TC plays a regulatory role and corresponding mechanisms in the development of cancer, such as lncRNA MALAT1 function as a ceRNA via sponging miR-200a-3p to derepress the FOXA1 expression [37]. Further, linc01278 is known to sponge miR-376c-3p to positively regulate the DNM3 expression and ultimately acting as a tumor suppressor gene in papillary thyroid carcinoma [38]. SNHG3 binds to the miR-214-3p target in regulating the proteasome 26S subunit non-ATPase expression [39]. These studies have proved that there are pathogenic mechanisms rellated to the regulation of ceRNA in TC. However, there is a need of additional studies to systematically explore the internal regulatory mechanism of the ceRNA.

PLOS ONE
In this study, WGCNA was used to screen the modules of mRNA and lncRNA that showed a significant trend in the development of TC, including one brown module of mRNA and one yellow module of lncRNA. At the same time, the differential genes expression analysis was used to screen the different miRNAs. Using the interactions of lncRNA-miRNA-mRNA, we constructed a ceRNA network, and then we analyzed it by KEGG enrichment and survival analysis.
From the KEGG pathway enrichment analysis, we found that 17 mRNAs in the ceRNA network were mainly enriched in 31 pathways. Among these pathways, lipid metabolism is one of the TC characteristics [40]. In addition, many molecular signaling pathways of these pathways play an important role in the development of tumors, such as PI3K-Akt signaling pathway and the mTOR signaling pathway [25]. Many other pathways are enriched in different cancer

PLOS ONE
pathways and miRNA pathways related to cancer. Thus, these genes may promote the development of TC through ceRNA.
According to relevant literature, miRNAs in the ceRNA network is involved in the regulation of cancer occurrence and development, such as miR-206 can inhibit the proliferation and invasion of thyroid cancer by targeting RAP1B [41]. miR-139-5p not only plays an important part in tumorigensis processes, but also serves as a biomarker with sufficient sensitivity and specificity in the diagnosis of cancer in clinical setting [42]. Zheng et al. found that lncRNA DANCR plays an oncogenic role in the progression of tongue squamous cell carcinomavia

PLOS ONE
targeting miR-135-5p [43]. miR-140-5p suppresses the tumor growth and metastasis by inhibiting the transforming growth factor beta (TGF-β) and mitogen-activated protein kinase/extracellular signal-regulated kinase (MAPK/ERK) signaling [44]. Further, the miR-125a-5p levels in breast cancer can be a useful prognostic biomarker and offer a novel therapeutic avenue by targeting the HDAC4 in breast cancer [45].
In this study, the Kaplan-Meier survival curves showed that 4 mRNA and 1 miRNA exhibited a noteworthy impact in the prognosis of the patients. Among these genes, DDIT4 activates the mTOR signaling pathway to promete the glioma progression [46]. ENTPD1 serves as a metabolism-related gene and plays a critical role in TC through regulating metabolic pathways [47]. Mutations in the LDLR gene can cause abnormal lipoprotein metabolism, and the expression imbalance of LDLR may be related to the development of cancer [48]. MIDN promotes the expressing of parking E3 ubiquitin ligase, and MIDN loss can trigger Parkinson's disease [49]. In addition, from to the database comparison process and ceRNA network visualization process, we can know all the genes that are related to the corresponding miRNA. The above studies indicate that these genes may be used as potential biomarkers of TC, which can be beneficial in the diagnosis and treatment of TC patients.

PLOS ONE
The aim of the present study was to find the genes closely related to the occurrence and development of TC. However, considering that most of the genes and diseases are not directly related, we conducted our study based from the perspective of ceRNA network and then elucidated the mechanism by which these genes interact to cause cancer. At the same time, we combined the clinical data to study the influence of these genes on the prognosis of TC patients to make the study more meaningful in a clinical setting. Although this study highlights the prognostic value of this hub ceRNA network, there are certain limitations of this study. First, as the sample data was limited, some genes in this ceRNA network were found to be regulators in other cancers. However, there was no significant difference in the TC patient survival, when a few genes overexpressed or underexpressed in the TC samples. At the same time, our research was limited considering the relationship between the known genes. In order to solve these problems, computing models can play a vital role, which needs a deeper study. Many studies have used known interactions between the genes to predict human diseaserelated ncRNAs. For example, Chen et al. [50] used the known miRNA similarity and disease similarity to construct the miRNA-disease associations, in order to predict the diseaserelated miRNAs. Follow-up studies have focused on lncRNA-miRNA interactions, such as Zhang et al. [51] applied a semi supervised interactome network-based approach to explore and forecast the latent interaction between lncRNAs and miRNAs. Zhang et al. [52] developed a network distance analysis model for the lncRNA-miRNA association prediction (NDALMA). Similarity networks for lncRNAs and miRNAs were calculated and integrated using the Gaussian interaction profile (GIP) kernel similarity. Similarly, Liu et al. [53] established a novel matrix factorization model to predict the lncRNA-miRNA interactions, namely the lncRNA-miRNA interactions prediction by logistic matrix factorization with neighborhood regularized (LMFNRLMI). Wang et al. [54] conducted analysis based on cir-cRNA, and used the known associations of various diseases to build computational models to predict more potential associations. And some studies [55] are to improve the existing algorithms to make them perform better in predicting gene-disease association. The advantage of these studies is that they were not limited to a single disease and individual type gene, and a large amount of useful information can be mined, with high reliability and stability. In future studies, we can combine the known gene-disease association, as well as the gene-gene association, and use multivariate datasets to build computational models to predict whether there is an association between the lncRNA-miRNA-mRNA gene pairs and diseases. The purpose of this study was to committed to the development of medicine, we first assume that all the samples there is no abnormal clinical data, but still exist in the process of data processing the phenomenon of lack of some clinical data, we eliminate the abnormal data, at the same time we analyze the factors considered in the process of too little, the subsequent will consider into more clinical factors [56].

Conclusions
The present study established the lncRNA-miRNA-mRNA network in TC to explore the pathogenic mechanism of ceRNA that may be related in the development of TC. The results showed that the genes contained in the ceRNA network were significantly associated with patients' survival and TNM staging, which could help doctors make clinical treatment decisions for patients. Meanwhile, our research process could be evaluated at a lower cost than conventional gene screening methods such as gene sequencing.
The ceRNA network was established by multiple screening models, yielding network components with high prognostic value for TC. More importantly, it comprehensively elaborates the post-transcriptional regulatory mechanisms involved in TC and the ceRNA network proposed in this study can provide a reference and direction in the clinical setting, and provides potential biomarkers in the diagnosis, treatment and prognosis of TC patients.